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DUPLICATE 



PROCESSING OF DATA 



The present invention relates to the processing of data, such as for example data which is 
represented in graphical form. 

5 

* - 

One example of such a graphical.form of data representation is a data model known as 
Resource Description Framework (RDF), which represents data in the form of a 
mathematical graph, that is to say a graph of nodes and directed arcs, and in doing so 
illustrates any interrelationship of different data attributes. In accordance with the 

10 terminology of the RDF data model, data is represented either as a Resource, a Property, 
or a Value. One of the values of representing data graphically is that, in theory it is 
possible to allow data thus represented to convey semantic meaning, and for this reason 
RDF is currently the leading candidate data model for providing the basis of a semantic 
Worldwide Web. An example of the use of RDF is illustrated in Figs- I and 2, which 

1 5 show a tabular representation of two conventional database entries, and the representation 
of the data forming the entries of Fig- 1 in RDF respectively. 

Referring now to Fig. 1, two records whose data it is desired to store in a database are 
illustrated. Each record has three attributes: the publication number of a patent,, the 
20 inventor designated on the patent, and the author of the specification of the patent. As 
can be seen from looking at the records, the inventor in each case is the same, and so to 
this extent at least, the two records are interrelated. 

Referring now to Fig. 2 7 both records, and their interrelationship can be represented iu a 
25 graphical document format known as Resource Description Framework (RDF), and an 
RDF document representative of the two records is shown in Fig. 2. The RDF document 
may be thought of as graphical representation of the data in Fig. 1, which also describes 
the structure of that data, and contains essentially three elements: Resources^ Properties 
and Values. Thus for example, the document in Fig. 2 has a resource #A1 , This 
30 Resource is labelled #A1 7 although ui the event that the resource could be named by a 
Uniform Resource Indicator (URI), such as for example a web page address, this would 
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also appear in the name of the Resource. In this example the resource has no such name, 
but has four different properties which, inter alia serve to characterise it: Patent No., 
Author, Inventor (all of which may intuitively be related to one of the records in Fig. 1), 
and «rd£ type". The first three properties are simply the different attributes of one of the 

5 records shown in Fig. 1, while the fourth indicates the type ornature of the Resource, 
which in this instance is a patent. With this in mind it follows that a patent (which is the 
"type" of the Resource) has me properties of Author, Inventor and Number, and while 
this may not be the most intuitive way to describe a record in Jig. 1 from a lay person's 
perspective, it nonetheless is possible to see that all of the mformation shown in a record 

10 in Fig. 1 is replicated in this format. Thus the two Resources #Al and #B1 relate to the 
patents 5678 and 1234 respectively. 

The properties of Inventor and Author for each of these two Resources are respectively 
represented by further Resources: #B2 which corresponds to the inventor - since the 

1 5 inventor is the same in each case; and #A2 and #C2 which correspond to the two authors. 
The Resource #B2 is thus the Value of the Inventor Property for each of the Resources 
#A1 and #B1 , and itself has two further properties, one of which is its rdfs: type, 
indicating that the Inventor is a person, and the other is the name of the inventor, which is 
its 'literal" Value, the inventor's name A. Dingley. The Author Properties of the 

20 Resources #A1 and #B1 are respectively the Resources #A2 and #B2 and each have an 
idfc: type property which signifies that the Author is a person, and Name Properties 
having literal Values, which are the names of the Authors "Formaggio" and 
"Cheeseman" respectively. 



25 



30 



Thus an RDF document describes completely both the data in a record, its nature and any 
interrelationship with data in another record The purpose of representing data in such a 
manner is essentially to provide a common format independent of the source format of 
data, which may be manipulated by computers, and which contains all of the original 
data. 
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One of the problems associated with the graphical representation of data, which is a 
problem well known per se, is that limitations of current mathematical theory 
correspondingly limit the ability to process the data. For example, it is not within the 
scope of current mathematical theory to provide, analytically, a rigourous topographical 
5 comparison between two graphs. Such limitations in turn limit the extent to which 

graphically represented data can achieve the aims of providing the basis of, for example, 
a semantic worldwide web- 

A first aspect of the present invention seeks, inter alia to ameliorate this problem, and 
10 provides a generally applicable method of processing data, which is applicable to the 
processing of graphically represented data. According to a first aspect of the invention 
there is provided a method of processing data (typically, but not necessarily graphical 
data) according to which data is processed in accordance with a first set of rules, which 
operate, intef alia to define a stage at which such a processing operation ceases, and then 
15 applying to the partly-processed data a second set of rules, which operate to modify the 
data so that the data thus modified is then processable further by applying a third set of 
rules. In a preferred embodiment, modified data is then processed by the third set of 
rules. 

20 Although the data which it is desired to process has itself been changed, because this 
change has taken place on the basis of a denned set of rules the outcome of these 
operations, and in particular the manner in which the outcome of the processing differs 
from an ideal (in which the unmodified data is processed completely), will be well 
understood. Different sets of data processed by applying this method may thus be 

25 compared, combined or otherwise used in conjunction with each other on this basis 

provided consistent rules are applied to their processing. In one preferred embodiment, 
the first and third sets of rules arc similar, and in a further embodiment they are the same. 

Preferably the method outlined above is preceded by the deterministic modification of the 
30 data prior to applying the first 3et of rules. Deterrninisrtic modifications may. in layman's 
terms, be thought of as modifications which are not significant. Thus by analogy, the 
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colour of Hie paper on which a patent specification is printed is not significant v* a vis 
it's legal effect, and may ihus he thought of as being in rfgaificant. Conversely, non- 
determlmstic modifications are modifications which are significant 

5 Preferably modification of the data following processing in accordance with a first set of 
rules is/are non-deterministic modificatidns (Le. significant), but may he "labelled" <*s 
insignificant. 

A more detailed description of an embodiment of the present invention will now be 
10 provided. Firstly in a prototype implementation of the method for UNIX, shewu below, 
and secondly in the accompanying proposal for a conference paper entitled "RDF 
Canonicalization A Cheater' s Guide". 
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APPENDIX - Unix and GNU awk (version 3.0.3) partial implementation 



= README 



5 1 
2 

3 This directory contains a prototype implementation 

4 cT the techniques described in RDF Can onicalization; 

5 A Cheater's Guide by Jeremy I- Carroll. 
10 6 

7 The xTuplemeniarioEn requires Ntriple files 85 input 

8 and outputs Ntriple files in canonical RDF. 
9 

10 The treatment of space in literals does not 

15 11 conform with ihe lexicographic order in the paper; 

12 spaces are sorted as the character sequence "\u0020 M - 
13 

14 The files are; 

1 5 alfae - A file with two or more blank lines. 

20 16 c!4n.awk -Anawlcscriptimplcmcn^n5 5icps7and8 

1 7 from the one step deterministic labelling. 

IS cscspc-space.fi ed 

3 o • a &ed flic For removing spaces from literals 

20 in Niriple files, (They arc replaced with 

25 2.1 the illegal escape sequence \u0020) 

22 inultip&ss.awk - An awk library to allow -multiple passes 

23 over the datflstrcam in a single awk process, 

24 mulhsiep.&wk - An awk script that initializes a. one or rnnlti 
23 step deterministic labelling, n sing the 

30 26 el 4n. awk and rnultipass.awk libraries. 

11 muliistep.fih - A shell script that invokes flie sed and awk 

28 scripts appropriately - start here. 

29 maltisTepdeletcsh - A shell script thai turns arbitrary 

30 N triples mm canonical RDF by deleting 
35 31 the unlabeled nodes after a rrrulrisiep 

32 labelling 
33 



34 



40 = aline 



PDNO 200300135 



15/11 '02 17: 01 FAX __, 



UK PO 



I3D06 



I 

2 
3 
4 

5 5 

= cl4is.awk 

1 

10 2 

3 # This file defines the follows posses that 

4 # can be used with the multipass. awk library. 

5 # "3tep7" ™ one-step deterministic labelling. 
$ # "stepS" in the on=-siep deterministic labelling. 

15 7 

8 {dType = 0 

9 dSame = 0 
10 } 

It $1 — {dTypc+= 1 } 

20 12 53 1— { } 

13 PASS = "step4- && SWV { S(NF+t) = 

15 SI--" 

16 > 

25 17 FaSS = "stepS" && SWM { S<NF+1) = 

19 

20 > 

2.1 PASS — "step4" { print > tmp } 

30 22 PASS — "stepS" { print > trap } 

23 PASS — "dclctcS*™'' && < $1 = "~ II $3 — ){ print > bnp } 

24 PASS , s -step?" &Sl FNR 1 dTypfe != 0 { 

25 dSams - dL^Typei-0 && ( dLastfil — SI && dU*[2] ~ S2 dL a *[3] = S3 ) 

26 } 

35 27 PASS = "step?- && FNR !*= I && dLastTyps H 0 { 

28 if C (idSftrtie) &&. OdLastSamc) ) { 

29 nuniberVarsO 

30 } 

31 > 

40 32 PASS=Vep7"( 
33 pxintLasrQ 
34 dLastSame = dSame 
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35 dOLastType = £Type 

36 dLwfcJ-cftfith = 5pIit(S0»dLast) 

37 } 

3 S FASS^^stepS" { wdMuahberVarsOi P™* > } 

5 39 AfT£RPASS= n step7" •{ 

40 if ( dLastTypc != 0 (!dLasiSame) ) { 

41 mnribcrVarsO 

42 } 

43 priirtLastQ 

10 44 dLastLcngth = 0 

45 } 
46 

47 function prmtLast( di) 

4S { 

15 49 if ( dLasiLeagth 1= 0 ) { 

50 for ( di = 1 - di < dLastLeagcfa ; di-H-) { 

51 primf "%5 ",dLast£di] > tmp 

52 } 

53 prilltf "^>5\n",dLasitdi] > trap 
20 54 } 

55 ) 
56 

57 function numb=rVar£() ( 

58 mmJbefVai{3) 
25 59 mnnbcxVar<l) 

€0 > 

61 function nuniberVarCix) { 

62 IfCdLastfix]^-") { 

63 name = dLast[dL*s*Lengfh] 

30 64 dLaslLcnsfr- 

65 if ( !dTable[namc] ) { 

66 dTablcInamc] = sprini£*%6.6d\coraita++) 

67 } 

68 dLasl[i>c] = dTablc[name] 
35 69 if(dI^t[JLasiI^ch]!= , ^ , ){ 

7P print "Shouldn't happen" > "/dev/stderr" 

71 ) dsc{ 

72 dLasiLf.ngrb— 

73 } 
40 74 } 

75 } 
76 
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10 



15 



20 



25 



35 



77 

78 

79 

80 

81 
82 

83 

84 

85 
86 

87 
SS 
89 
90 
91 
92 
93 
94 



runCLionuBcNiimberVareO { . 
uBc"NumberVar(3 ,NF) 

} 

function n5cNumberVar(iK s pQ6) { 

if ( ldTsble[Spos] ) { 
return 

} 

Six = "_;g" dT^blfilSpos] 

) else { 
Spos=S(pos+2) 
NF-=2 

} 

} 



} 



escape-ap 9C>e 



.sod 



1 :ictry 

5 #sAa^\] ,, \a A WlI\U)n) Al\\u0020/ 

6 tltstry 



= fmtx 



30 = 



1 


#!/brn/C£b 


2 


fforqsch -i ( 5* ) 


3 


echo " 


4 




5 


echo " 


6 


pt -t -n Bi 


7 


end 


8 






# This is a library that allows *<= use ol awk in 
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2 # multipass modfi over stdin. 

3 # use addPfessCfbo") to add mi extra paa e over 

4 # th«= mpuL The pass ifi named "fbo"- 

5 # to print oiirput for use in the next pass 

5 6 # use print > t"*p 

7 # To access the name of lie current pass 

8 # use the variable PASS 

9 # A special pass called "sort 1 * maybe added, 
10 # this sot*p *e data in lexicographic sort order 

10 ll 

13 BEGIN { 

14 imp — "trnpA" 

15 otberTrtrp = "tmpB" 
15 16 ARGC^l 

17 1 
18 

19 { if (DEBUG ) print FILENAME, ARGIND, passNTUORtfARGIND], $0 > Vdev/stdcrr" } 

20 {AFTERPASS = 0 

20 21 iff FNR — I &.& ARGIND % 2 = 0 ) { 

22 AFTERPASS=PASS 

23 } 

24 > 

25 {if (FNR = 2&& AJRGTND % 2 = 0 ) { 

25 26 if ( paa5Naxries[ARGlND-l ^"sort" ) close^on) 

27 else close(tinp) 

25 x = tmp 

29 rmp = otherTmp 

30 31 aextfile 

32 } 

33 } 

34 {if(FNR=l){ 

35 PASS = passN ames[ ARGIND] 

35 36 if ( ARGC = ARGIND+2 } Tmp = VdcWstdoiil" 

3? if ( PASS = "sort" } { 

38 if ( ARGC = aRGINCh-2 ) fion.= "son" 

39 cl&c sort =* "sort > " tmp 

40 } 

40 41 } 

42 } 
43 
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44 PASS — "sort" { priat I sort } 
45 

46 function ^ddPassfpassNsme, 0 

47 (i = (ARGC+iy2 

5 as pB5sNam&s[ARGCl = passNamc 

45) ARGV[ARGCJ = { i %2 = 0 7 "tmjxA" : «W*" ) 

50 if (AHGC=1> ARGV[ l]='Vdev/stdin" 

5 1 ARGV[ARGC+1 J = "aline" 

52 AR.GC4=2 
10 53 > 

54 
55 
56 
57 

15 58 



= miih.istep.awk 



20 2 * This implements ihe xflultistep and the oncstcp 

3 # dstennmistic labelling aJ^orrilhiA from tha 

4 paper RDF Csmanicalizatiaii: A. Cheaters Guide 

5 Why J. Carroll. 
6 

25 7 # The command line should be: 

8 # awk -fmuttip^wlc -fcl4n.awk-fnmLtistcp.awk [NN] ^triples*! 

9 # [NN} is the desired number of steps (default one) 

10 # ntriplc5.nl ii Hi6 RDF graph as an ntriplea file as tnpm. 
11 #Note spaces in Uterala a« not supported. 

30 12 
13 

14 BEGIN { 

15 if (ARGC = 0) STEPS =1 

16 else STEPS = ARCV[ARGC] 
35 17 addPass("step4 ,r ) 

lg addPasaC^tcp?'! 

19 ^ddPasiC 'deleteSome") 

20 addPaJKjCW) #sttp<5 

21 ror(i= OtrcSTEFSii+f) \ 
40 22 addPassCstep7") 

23 addPaasC'stapS") 

24 addPffl39C M s°rf"> 
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25 i 

26 } 



nuiliisiep.sh 



1 #!/bm/sh. 

2 export LC_AIX*C 

1 sad -f cscepfr-spaccscd [ awk -f miiltip^-awk -f cl4n.awk-f nttiMstep.ifwk $1 



10 = njultiscepdeLetcsh 



J m/bin/sh 

2 export LC_ALL=C 

3 ymriUisrep.sh $ 1 1 aod -d 7~/d' 
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ABSTRACT, 

Disiial signatures require candied f<"*» for the d ^ a™*"* 
Gi^i the inp^ce of boih the Res«n« 
Desaipti^ Fmroewnk (RDF) and digit*) ^naw*es in sawnta 
web Jcb^r^ it is highly de^Wc to have polynomial trrn= 
^otneelfeation BlB^i*^ RDF. This paper 
that ROT cananicaKz-tioa is Gtaph twm^hism cornplete, and 
nW u* mat. probably, iw audi polynomial time al^nthn. e»ds. 
Howavar, » pt»clfcal solution U»i P™^"^™'^" 
of any KPF graph in subquadranc cms is demonstrated. Tins 
S^a* idenlificanoa of the difficult Wan* .nodes 
then adding natte triples: slisWy d.**6*g RDF S™ph 
jnio a straigbifbrwardly eaK>nicalizabl= on*. 

Categories and Subject Descriptors 

GX2 [Cra»h Theory]; Qm> k algorithms, Graph labeling. T2.Z 
rNonmiinerical Algorithms ana Pw*l««]: JW» malchmg 
Complexity of pwf procures, 

structure:, Sorting <md aerddngVl 12. [Document Preparation]. 
Markup languages, Standards 

General Terms 

Algi0n*m£= StandardizatioD 

M^^^tif signatures, XMI. emonicalization, OWL, &apk 
isomorphism. 

1. INTRODUCTION 

This paper concerns technique i* canorricalize W3F graphs. In 
other words, a method of choose ^ of the very many different 
ways or writing down an S^P 1 *- 

hi many mimical areas wc find equivalences between tongs- This 
resell, in two superficially dilTtrtdt objects beins regarded as The 
5W ne Omonicalization is an important technique to side siep 
some of the less frxiTzbl* consequences of H»e snpeificiaJ 
differences. 

Two typ« of cauonicalization appearing in web technologies axe 
The clonics! lexical TcproscniMives of specific drintypc vnlueSBi 
XML Schema Dataypc* [fi], and me canonical^** of XML 
documents and document subsets [7]. 



Canonicalization within web tecrmolofties. is largely concerned 
with lexical considerations, 

One or the crucial applications that XML C^orticalization [7] is 
intended to service is lhal of digits! signatures. Digital 
arc applied to sequences of bytes; and can be used to detect 
chances in such- Without a canonical form ir is necessary to store 
ihe original XML document along with the ^presentation used 
during former XML processing- A further advance that the 
XML Canomcalizaiion recommend ations bring is the abilny to 
express a aocument subset as a well-defined byte seonence ft* 
cm be signed, independently ofthe full document. 



1.1 Semantic Web Architecture 




RDF Canonicalizaiaon 



XML CaJ^oicaiization | 



1 Jeremy Carroll is a visiting rescarcber at IS IX CNRPisa- 



Figure 1 : Berners-Lee's Arehitectnre [31 
Fox Ac semantic web [5] and semantic web web services [14], 
XML documents serve merely as a trarsapart for a deeper 
meaning that is expressed in terms of RDF graphs. Dwfal 
siKTiatirres play a crucial role in enabling the trust layer TO be built 
tap of the lower layers- In Figure I, Beraers-Lee's well-known 
architecture picture ia annoUiied to show the position both or the 
XML Canomcalization Tecorjimendations, and the work of this 
paper, 

Cr — icaiiang me semantic layers of the web at the level of the 
XML documents would be as misplaced as cauooicaliznnj XML 
documents only st level of character encoding. 

1 a XML C14N 

In XML there arc a number of arbitrary choices made which are 
xaot generalry seen =* cc^buting to the Wrnng' of tie 
document While il is unclear quite what that meaning might be, 
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none of the XML family of recommendations sec certain aspects 
of an XML document as significant. These include ihe choice or 
character encoding; the choice between single and double quote* 
for attribute values; and the choiop of white space within element 
wg5 Canonical XML [7] determines preferred options for such 
elites, and, in fact, for all aspects of me document that are 
insignificant m the XPath nodeset [1 1 J, 

These choice may change how the document looks in a simple 
text editor. 

XML canonicalixaiion is the process by which an XML document 
is converted Into another one, which is in canonical XML, hut is 
otherwise identical to the original, 

1.3 Canonical Ordering for RDF 

In XML, the document ordering is largely preserved during 
canonicahzation. Whenever two elements of the XML 
Information set appear in the eanonicalizcd document, they appear 
in the same relative order as they were found in the 
precanon^calized fbirn. 

RDP docs not have a notion corresponding lo XML T s document 
ordering. The ROT ahstract syntax [IS] is defined as an 
(unordered) set of triples. 

Any specific document contanriog i serialization or an RDF graph 
will however have imposed tame total order on those triples- 

Thus any canonicalization of RDF involve? * canonical ordering 
of the triple* in the KDF graph. Almost all or this paper deals with 
that part of the problem; the details or how, grven such an 
ordering, the file can actually be written *n= sketched quickly. 

1.4 Difficulties with KDF Canonicalization 

The fundamental problem is that RDF Graph canonicalizwtion can 
be shown to be equivalent to the Graph Isomorphism problem. 
The complexity (GI) of this latter problem is well-researched [20J 
and is conjectured to be strictly harder than polynomial time, and 
strictly easier than nou-dctcrnrinistie polynomial tfrne. i.e. 

P<GKNP 

Such a high cornplexiiy is unacceptable in an mmtstructmal 
component. 

1.5 Meaningless Changes 

The key insight of this paper is mat while m general RDF Graph 
eanonicalizarion is GT complete, all mtcrostm^RDF graphs can he 
plichtly modified (typically by adding a few, explicitly 
meaningless, arcs) to be in a class or RDF graphs which can b= 
much more easily eanonicalizcd (0(nlo$a)). The reader should 
judge whether these modifications are merely a. dirty hack or an 
clcgaut engineering compromise. 

An alternative engineering approach of simply deleting 
problematic parts of the graph is also explored, 

2. A SIMPLE EXAMPLE 

Consider an RDF Graph with wo Triples. It can be represented in 

N-Triple5[16]asafite: 

# Here a e/rapA 

.•aslazifcffode <eg:pjrop> * 

It can also he represented as: 

rax <eg;prop> 



The two tTles are different, but the grapb thfcy describe is the 
samcw The differences between the fiJe? are insignificant. These 

include: 

• The ciwMments- 

• The whitespace- 

• The order of the lines. 

• The choice ofblauk node identifiers (eg- aBlankNodc or ax) 
Other aspects or the N-Triplcs are significant in that they 
represent the intended abstract RDF Graph [IS]-- Sigcificant . 
aspects include; 

• The presence or absence of a triple 

• The string iii each literal 

• The URIs &r each property or resource 

Canonical KDF is chosen so that there is at most one.. 
Teprtsenwrion of that graph in Canonical RDF. Thus, we can . 
check mat these two are the same by inverting both into 
canonical RDF and doing a character-by-chnnictcr compau^tm- 

OT the insignificant differences highlighted, only two present any 
interesting difficulties: the order or the lines, and the blank node 
identifw 



3. APPLICATIONS OF RDF CI 4N 

The technique; described in this paper can he used to enable: 
various applications over semantic web technologies. These 
correspond to the applications enabled "by XML Canraucalization 
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There are two applications s 
rtc*i*flnendauoxL: 

• testing whether the information content or an XML. 
document or document subset has been changed 

* digital signature generation over "the canonical form of an 
XML document 

Lynch [2l] argues that canonicalization is a "Fimrhtmental Tool to 
Facilitate Preservation and Management of Digital ^formation" 
The scenario he discusses is the need i*> reformat a digital object 
which is being preserved in an electronic library. The need for 
TcfbTTnatting occurs as the software and hardware required to view 
the original objea become obsolete. 

Lynch uees (lossy) canonicalization as a way of defining the 
essence or a dcxrumeni. This essence,, if sufficient unimportant 
derails are discarded, will be invariant with reformatting. Thus wc 
can compute this essence of the original document, and the 
original author can sign this essence to vouch forte authenticity. 
Later, after the death or the author, and a Bcqucnoe of marry 
trtmsfennations or the original byte sequence, the essence of the 
document is unchanged, and we can still verify -flint the author 
signed it. 

Within the semantic web, a particularly important application of 
RDF canonicali^hon is likely to be the signing of OWL 
ontologies [13)- 

4* KDF C14N IS GI COMPLETE 

fwm « • j_ ja~Z- _ .MMnt pannni figlige Srblti'iVY RD"F 

graphs without making some modifications to them- This reflecis 
the underlying di niciilty of RDF canOTicalizBiiort. 
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A, discu^d by Carroll PL ^ 6*P* ™* thc 

isomwhism problem have equivalent complexity. 

Any unlabelted Wheeled graph can be .encoded in £. 
££g each nod, «im a blanl: node, and 
two arcs (in each direction) always using a single property leg- 
&° a simple triangle nodes, three arcs) can be 

en C od«i m N-Ttfplcs as: 

_;a <egtx> - 

_rb -cey-*^ _ ra - 

<ej--^> _/ c • 

__-c <eg;jp» ■ 

If ™= cOuM solve RDF eanoniwHz«««> in polynomial umc, *^ 
ic could comp-= t*o RDF graphs for equal.* a Ppl^m,^ 
to fer coStt fteir canonical represent^**). 1h» would 
^i^r^olA) polynomial time solution to *e &*ph 
Isomorphism problcw. [15], [20]. 

4 1 What's the difficnlty? 

g^h is omorptosm problem is deceptive It »llyto 
loo* hSl we will informally Explore ihe problem of 
rejiesattadau of ft simple graph, wuh t«o ^connected 
components shown in figure* 2 and 3. 




Figure 2 A 6 vertex grapl 




Figure 3 A different 6 vert** graph: K 3J 
The greph consists of 12 nodes ncd 'B edges. Bach node he. three 
neighbours, and two nodes et » distance of two away. Edge node 
ia m a connected component of six nodes. 
If we Hy and radicalize this, «c win start by writing some node 
before all the other nodes. That node *sn either come from the 
component in Hgure 2 or that in figure 3 The can^^cm 
algorithm needs to make a dcternnnafic choice. There is no 
tawmvery obvious rationale for o^oosmg one or the o*=r, ^ 
wbich ever choice we make scons to require w"*™*?£ ™* 
Just about the node, but also its neighbours, and thcn-naghbOOfS. 

Hnd ... 

A good solution to Aria problem is provided by McKay IMjPfl- 
Heusos wn analysis of the aulomorplusm group or a graph to 
canonical rcpr^-uuives. His sol*™ u of noa- 



polynomial complexity and complicated to program. « * 
^r^ito for an infiastructurel component within Ihe 

semantic web. 

We su«ff=t other techniques, ascribed below, which When 
™ITwith the remer unlikely RDF that encode, such ar. 
Sled undirected grcph, will quickly ■worV out thai mere «a 
Sem. One or the methods caaonicalte f^Pf °f *" 
^h which is not problem th* method wdl i d«k« ^nfte 
S in this example. The other method wflj change the example 
S^oS enough labeled nodes and dieted edges to 

K ^sibte to choose which node goes first based on 
lexicographic orderings of the labels. 

These methods sre Successful ** RDF bec^e real PD^ais 
npt ns difficult as these examples- Fence, the rafter rtitraryadd 
^Idcome aspects of bom methods, i» jwedee only get nsed a 
little. For example r*d*r than the ^ptk chsege of 
an the edg=e, in practice, ftaac method deletes only a small 
percentage oflhe edges, 

5. PREIJMINARIES 

5.1 Syntactic 

For clarity of e^itior, pittas N- 
based around the Ideographic sorting of N-Tnples [16] files. w 
Triples is an ASCII format. 

We use numbered gensym idlers during the ajgori to. T^e 
arc exited with just sufficient leading zeros so tb* lcjcico^aphic 
or*^ and integer ordering are same- (The 
leading zeros iwiic4 i* com^ted from the number of blank 
■nodes in the RDF graph being considered). 

5.2 Semantic L1 ^ _ 

Tnc techniques in thi? P «P«r rely on hdng able to make 
meaningless chafes to m graph. is done in accordarice 
wi* *e RDF formal semantics [17], by using a spwasl prop=ny, 
vyhich W 6 conventional refer to as cl4ai« true defined ht« 

<nif :RDP 

WLAS: 6CX411J 

criiCB = descriptic»n.> 
Tfais pi-aperty is true vhatever reflource Is 
its siabj«et, nnd vhateyer literal ?-& ^tP 

Qbj©et- j 
Thus triples with literal object*, a™i 
tru6 as predicate, can arbitrarily be 
added to and deleted from an *BF graph 
without changing - its meaning- 
</xdf a : dascr iptic«a?» 

</rd£» :PropartY> 

^rS^ leferencc tc^n? is bound * tiie URL 
htrpv//^-^^pl.h P -ce^aopl e /iDc/Tdf/cl4n - 

The entity declnration and the ^claraiions of me rd£ and rdf s 
Tj9jnB9paces have been omitted. 

By spcoilying this predicate w bring always nue, adding w 
dclelmg triples vrifh mi. predkaie does not ate the cntadmcrts 
nnderdw RJ3F «sod4l theory. Fonnafiy the semantics of 4c 
document are unchanged- 
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S3 N-Triples to Canonical XML 

In this paper, we concentrate on wcMitig canonical 
xeprcsftataiions of RDF 5T&ph« inN-Triples [16]- 
However, for fiill integration with tools built on Canonical XML 
[7] it is' necessary to transform Ihese files into XML, one, 
moreover, we wish the XML produced to amonicatty depend an 
the original RDF graph- 

Conveationally we use ndf as the prefix for the KDF namespace 
declared on the root clement or the docwnenL 
For simplicity we will Turn each triple in an N-Triplcs document 
into an rdf : Description- element, which contain* a single 
property clement 

If tke subject of the triple is a U*l reference, then we use en 
rdf = bbcjuE attribute on the irdf : Description element, 
otherwise the subject is a blank node and we use an 
xMf tnodelD attribute. We w= the blank node identifier rrom 
Ihe N-tripIe as the blank node identifier in the RDF/XML. 
The predicate of the triple is expressed using the default 
namespace op ihe property element, with the predicate URIref 
being split at the leftmost legal point T-e. the local name is as long 
as possible, (Some predicates have no legal split point and such 
graphs cannot be serialized in RDF/XML, see [2])- 

If the abject of me triple is a URI reference, then we use w 
r <3± V resource atrribute on the property element. If the objeet 
is a blank node, wc use an rdf tnodelD attribute. If the object is 
an XML Literal we use rdf : pare eType= v Literal' , If 
the ooject is a typed literal we use an rdf s datatype attribute. 
For all literals we put the lexical form into the element content 
Wc use a newdine sfler the opening <rdf : RDF> and after 
each property clement end tag. Otherwise we only use whitespace 
within element tags and as specified by literal lexical forms. 

The triples are converted into XML in order. 
The resulting XML Cue is then cpnoniealized [7]. 

6. RDF C14N WITHOUT BLANK NODES 

To create a canonical N-Triples fib for an RDF graph without any 
blank nodes we use the following algorithm: 

1 . CfcjiMicalize each XML Literal [3 8] in the graph using XML 
canonicalizatioTj [7] . 

% For each typed literal in the graph canonical izc 2 it according 
to the rules in XML Schema datatypes [6], That is, given a 
typed literal <datatypeURI r lexlcalFozxu- replace 
h with <d^ttatyperai. l«xIcaXPoxm'> 3 where 
aexicaLFona' is the canonical from of lexicalFona 
according to the datatype specified by dafcatypeURl- 

3. Write the graph as anN-Triplcs document [16]- Each line of 
me document i& a complete distinct triple of the gmpb. 

4. Rjeorder the lines in the N-Triples document to be in 
lexicographic order. (This could be implemented simply with 
Unix™ sort 3 ). 



2 E)atBtype& that arc not XML Schema built-in types arc not 

supported in this version, 
* sort in Unix uses a locale dependent ondcring. To use US- 

ASCH order, it is njeccssary to set the environment variable 

LC AliTj-C- 



7. LABELLING BLANK NODES 

If we have blank nodes in the graph ihen life is somewhat trickier. 

Tn N-Triplcs blank nodes are represented wing blank node 
identifiers, which can appear in subject or object position. 

Unfortunately, these identifiers are ^ensyms created drums 0,6 
writing of die N-Triples, and arc not an intrinaiD part of the graph. 
Hence, it is an error if the canonicalization depends on these 
gensyms. In contrast, the canomcalizaTioii ale^ritbm must 
detemainistieally choose new blank node identifier*. 
In this part of our approach, we iirst wrifje out the file using . 
arbitrarily chosen blank node identifiers; then we sort the 
document (mostly) ignoring those identifiers. On the basis of this 
sorted document, we then rename nil me blank nodes, in a 
(hopcJUTly) dettrministic fashion- 

Since me level of deteitninism is crucial to the workings of the 
canonicalization algorithm, wc start by defenng k dts«5rrninistic 
blank node labelling algorithm. This sufiers from tic defect of not 
necessarily labeling all the blank nodes. 

Deterministic, in this crmiext, means dependent only upon 
significant parts of the initial representation of Hie graph, and, 
nondetzrminkiic means dependent in part, upon insignificant 
parts orthe initial representation of the graph. 

7-1 One-step Deterministic Labelling 

In RDF [18], a single triple can comam isero, one or two blank : 
node? (the property must be specified by a URI 14J). 

Wc present an algorithm thai labels same blank nodes on the basiB 
of the immediate neighbors of that blank node (the one-step in the 

•name of thft algorithm^). 

The algorithm from section 0 ifl modified: 

1 . C^onicalxze each XML Literal [ 1 8] in the graph using XML 
canrmieslizatton [7) . 

2. For each typed literal in the graph canonicalize It according 
to the rules in XML Schema datatypes [6J. 

3. Write the graph as an M -Triples document. Each line of the 
document is &- complete distinct triple of the graph. 

4. For each line with a blank node identifier fn subject position 
(e.£. isubj), replace the blank node identifier with 

and add a comment _setibj* r to the end of" the I me, 
indicating the original identifier. , , 

5. For each line with a blank node identifier in object position 

db j ), replace the blank node identifier with and 
add a comment _:abj* to the end of the line,, 
radicating the original identifier. 

6. Reorder the Imee in the N-Triples. document to be in 
lexicographic order. (This could be implemented with 
Unix™ eoz-t). 

7. Use a gcnsyni counter, mitialized to 1, and a lookup table., 
initially empty. Go through the file from top to bottom: 

■ 

a. If this line is die same as the next or previous line 
excluding any trailing comment, continue to the next 
line in the file. 

b. If there is a m -* at object position: 

u Extract the blank node identifier from the final 
comment in the line. Remove the cc^nrncnt 
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iL Look the identifier up in tnc table, 
iii If there w no enrry, insert & new entry formed from 
~_=g" concatenated with the current gensym 
counter value. Increment the counter, 
iv. Replace the'-' w,'fl» ihc v^lue from inc tabic 
c If there Is a in subject position, use to* ^ 
' 8 ubpro Ce dur C to replace it with a consistently chosen 
gensym. 

8. Using the same lookup table* Go trough ««= nle from top to 
bottom: 

a. If iherc ie a *-* in object position*- 

l Extract flic Wank node identifier front the final 

comment in the line. Remove the commertL 
fi_ l^ok the identifier up in the table, 
iii. if there is an carry, replace the with the 
value ficom th e table- 

b. If there is a in subject position, use flu same 
^tproialflrti to possibly replace il with a consistently 
chosen fiensym. 

Lexicographically sort the N-Triples again. 
Tne only non^leterminism here is &e sorter depend on tbf 
blank PQdc Ubdc in pairs of triples for which me rest of tie tnple 
is ^critical. Such pairs arc studiously avoided in the wgrnnent 
g f labels, and so the order in W hich such pairs «pp«r doea not 
cJTcct the labels chosco- that step 8, wt^ch does deal with 

the incomparable lines, doc* not choose any labels and i» hence 
deterrniriisiic. 

The algorithm will ddcmraistiGsIly label some of the blank 
Tiodefi, for others it leaves them unlabelled. Those note are 
inferred to as hard to label nodes. 

The operation of die detr™^ Urtfa* tt^j^f * 

triples that can distinguish one blank node from another. Th«e 

if/lrirtOTVe m>?fi5 arc chfiraxtoizjed by being unique tn ihe $apti 

even when all blank nodes are treated as identical. 

THe haid to label node* do not participate hi any distinctive 

triples. 

Example 1, is folly labeled by On* al^rwiflxmi 
<eg:a> <eg:foo> _ ta ■ 

<eg = b> <eg:prop> _=b . 
are transformed into these! 

531 <eg! props- »traA" - 

On the other hand, example 2 is not folly labelled: 

= a ceg:zee? "why" - 
cegsprop> "val B . 

_ : b3 <eg:prqp> "val" - 

are transformed into these; 

<:eg-b> <eg:prop> _'9l ■ 

~ ceg:prop> »val" . # _sb3 



S. CANONICAL RDF r , 

The algorithm Aove has the desired property ^.""J ° r * * ow 
IS daw, -nd hence oprimizable to be snfficently ran*. 

Thus we viU define canonical RDF on me W of this algorithm. 

A canonical N-TripJca document is one _ (without any «™"£>'- 
wh ich is unchanged imder the application of flic one-step 
deterministic libeling algorithm. 

That b canonical RDF in N-Tfiples has the folk-wing features: 
B Xh«e 1° 'hard to label nodes. 

. E vciy U*nk nod* identifier has the Jbrm gNNN where NNN 
J^Lc number of digits. The numhrr of the d«fls xs the 
same for every blank node identifier. Ai least one identifier 
has a non-zero first digit. 

. After deleting all triples M m no* 

occurrences of each blank node identifier en** fa mimetic 

orier, starting at 1, without gaps- 
. The file is iii lexicographic sort order. 
Such files are unchanged under the oawtep deterministic 
labeling. 

ffon^tepdetenninistfc labeHng sueees^llylabeli lall the nodes,, 
Ln the resulting output will he canonical M>F. In 
one-asp dctammistic labeling is idempotent So (he Hie 
in example 1 above is in canonical RDP. 

9. TBtE CHEATER'S GUIDE 

Th= one^iep detenninisrie labeffing nlg^ton ^ sufiioent^ 
quick, W O(nlogn), and the god °rthe rest of the paper is to 
^e^Valgprittan useable in all case* 
looking fbi an alsoriflmi that completely solves the RDF graph 
^niLisKion pioblcm, « will modify d»e problem m b* 
soluble by the algorito » wtU ensure thai there are no harf to 

label nodes. We show two ^ays ofmbdifyfae *e graph to make a 

deterministic labels algnriihrn worlt. 

This process of making significant and/or nnndeterministie - 
modificanons to die inptn fa order to make it easier to 
cananlcaliw it. ^eptable in a true cMonica^ianon 
algorithm- Tb"5 ^ ^ to it as •cbeatmE'; we try » cheat only a 
little bit; Hid to be informed by the application needs as to wh«t 
sort of diestfas we use. 

91 Deterministic Cheating 

The simpler approach is to delete all h»d to UWf 

graph. This invob-w deleting dl 0>o triples in which snch nodes 

participate- 

None of Dm deleted trfplrs will ^^^^ 
were, then fee blank nodes re them would have been labeled. 

The modified algorithm is: 

A. Perform the one atop o^terrnmiKric labeling. 

B. Delete all lines containing a in sutgect or object 
position- 

We note that this procedure is deterministic and iderrpoteot. 
Deleting voKbdM nodes appearfi drastic but as long as tocrc arc 
few Cetg. zero) ofthese note, it is n good practical solunon. 

Continuing easamplc 2 we have: 
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«eg:b> <ec/sprop> _ »cji » 

_=oj2 <eg:zee> "why" . 
after the application, of this algorithm. 

Far applications such as Lynch 's desire to cap tun? the essence of a 
digital abject despite rclormarring this is a good fit. [21]. This 
gjfowE verifying the long-term integrity oT digital deposits, -where 
flic entire meaning oo^ not need to be canonicalized, as long R3 
the greater part is captured in the canoniradization. 

9.2 Non-deterministic Cheating 

Other applications of the use of digital signature require that the 
■meaning of the document is fairy captured in ihe object that is 
signed. A (wrneWhar Tired) example would be signing a purchase 
order. Hence the deterministic approach above is unsatisfactory,, 
since it quire happily changes Ae meariingof a document. 

The second approach to cheating adds distinctive triples in order 
to make sure that none of *he nodes are hard to label. 
The version des cribed here uses many passes of the file - the Three 
important passes: 

• identify the hard in label nodes; 

• create additional distinctive triples for those nodes 

• detennlni$tic£lly label the resulting graph. 

In the preliminaries we arranged a ready £ apply of me a nin gless 
triples Cwith predicate cl4n:truc) that we can add to and delete 
from the graph without modifying iU meaning, 

Thus, can perform the following steps: 

A. Perform a one-step deterministic labeling 

B. if there are no hard to label nodes, then stop. 

[This step is to ensure the algorithm is itiempotent] 
C Delete all triples with predicate c 14 n = true . 

[Without Q this would not be idcrnpoLenL] 

TJ. Perform a one-step dcteiiiunistic labeling. 

E- Using a new lookup table from step D % and a new counter, 
scan The file from top to bottom, performing these step* en 
each line: 

a_ If there is a in object posiiion: 

i. Extract the blank node identifier from the final 
oommaiL m The line, ftemovc the comment. 

ii - Look the identifier up in the tabl a. 

in. IT there is no entry; add an entry to the table; 
and add a new triple to the graph with subject 
being the blank node identified by the identifier 
from the comment, predicate being cWmtwe 
and object being the string form of the counter; 
mcTcrnent the counter. 

b. If There is a in subject position, use the same 
sub procedure to possibly create a distinctive triple fbr 
thc subject as welL 

F. Pertorm a one-step deterministic labeling (of the new 
rnpdifted graph, wiih a new lookup table and counter). Since 
all nodes participate in a distinctive triple, every tabic lookup 
will find an enhy. 



This chesting is rion -deterministic in step E, but that non- 
dcternnmsm is fairly limited, because even with the rather naive 
ono-siep deter minist ic labeling algorithm almost all node? m/ 
almost oil (practicaDy occurring) RDF graphs will have been 
classified. 

So again oorrtinuing with example Z, we imd: 

«ieg:b> <feg?prop> : CjL . 

_jg2 <eg:pjrop5» "vftJ." . 
_;g2 £@g:zee> "why" - 
_=g3 <eg!prop> "val" . 

_ s g3 *JlC tp=//*~ .*px-*p .en/p+tvim/iie/zaz/cinU-t rues w 1 « . 

An application scenario in which this is useable is that of signing 
an OWL ontology [13 J. The ontologist creates the ontology in a 
tool, and then asks the tool to generate a signature using a private 
key. The took 

v applies this algorithm, possibly adding additional triples to 
the ontology. 

♦ creates a canonical HDF graph with the same meaning as die 
original ontology. 

■ computet the signature for the canonical RDF graph. 

r 

• adds additional triple(s) to the graph with the ontology UKJ 
as subject, and the signature as object 

The resulting graph (with additional triples both as a result of 
canonjcalization and reflecting the signature) Then replaces the 
original ontology. It is xhis graph (which can be canonicalized . 
without any changes) that ihe omtologist publishes. Users of this .. 
ontology can then: 

■ frad the signature in Ihe graph. 

■ find the public key using some public key infrastructure 

• delete the triple(g) carrying the signature from the graph. 

• apply the eetermiui^tic labeling to form a canonical 
representation of the grajuh 

♦ verity the signature of this canonical RDF using the public 
key. 

9.3 Further piscuasia* 

The one-step deterministic labeling algorithm used had the 
following characteristics: 

# Reasonably fast (subquadraric) 

• Detenninistie 

* Labels 'enough 1 bl ank nodes in realistic RDF 

■ Easy to explain arid implement 

The last point is helpful in a paper, but not a hard requirement. 
Thus m a deployed system we can imagine using a variant of ihe 
algorithm here. 

It is a practical cngmeering problem to choose a near-optimal, 
deterministic clarification algorithm that gets the best trade off 
between speed of classification and practical utility- The 
detenninistie labeling algorithm used in this paper is probably a 
touch naive, but only a touch. We can be sure that those used by 
Maekay [22] are too expensive, given their non-polynomial 
complexity. 
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10. ADVANCED METHODS 

10-1 ftfultistep Deterministic Labelling 

W one-step determhistic MWBofi only considers ; tte > 
SX^rfanynodo. » graphs, particularly those -n«* 
^n^Ktolo^dcfhtit^ (13], this x^y leave too many nodes 

unlabcllod- 

The solution is in choose a fixed depth eg- two or three, and to 
S^kS^neiehboura of* node up to "^^^ 
consider Ae two-step end thrce-sU* deterministic U**8 
S^Snna. Sioc« for a bid k a k-stap -dbcri voU, ™ 
kSpWearions of the one step method, we vM opwt almearslaw 

Otnloco). at is sttaiBhtfbrwerd to optimize k-s*p methods to 
make the aUrvr down much less ihan linsar)- 
To consider neighbours *t distance N+l wc Tnsort the partial 
Seleo^Sm of a detennini^ labeling based «m newborns at 
distance N. 

Having resort the' output, which «m take in* account those 
S^I^yti^mini^lly chase*, « can^j apply on* 
^dSSnic Ubdfau ^Am, which effectively 
neighbours at an additional step removed. 

For exempt tt* tuU tsvo-siep detenronisHc lading is « folW 

A. Perform a one stop deterministic labcHTig. 

3fX Repeat steps 7, 8, 9 from flur one pw* detetrniixistic bbclm* 

rijpriflun, witboatTtiftitialJzing the table or counter. 
For a k-step labeling need TO repeal B k-1 limes. 
Example 3 is a simple OWL expression, written ™ RDFi f XML as: 

<rd£ .Description £g:£ugar^Dry _/> 

^rff: Description ^ 8 sugar- -Sweet- /> 
c/owl :oneOfi > 
</owl:Cla©B> 

The conwpcmdn^ triple (using qnan^s as in the OWL Teat 

Cases [9]) sue: 

s j0 rdf -.type owl: Class - 
"si 2 < eg: sugar > "Dry" - 
""=j4 sugar* »of£Dry- . 
"~:36 <egssugar> sweet- - 
~ s j5 rdfzfirst - 
2- j5 rdftrest rdt=naL - 
~t j5 rdf: type rd£sl*ist - 
":j3 rdf = £irst - 

j3 rdfrreet _ :js . 
"*j3 ird£=type rdfsldst - 
"=jl rdfsfirst _:j2 - 
:jl rdf s rest _=j3 . 

rdf: -type rdf ^IiiBt . 
~ = ja owl : oneOf * 

Applying the one-step detenninistic lab ^6 *ns to labsl one of 
rbc nodes (sifll using qnsroe syntax for URIs). 



: g0 <eg;sugar> "Dry" - 
"sgi < eg: sugar > »©*S*y" . 
I:g2 <eg:sug^?:=» "sweet:" - 

_;g3 rtf = fi^SC _:g2 - 

; g3 rff: rest rdf : Ail . 
~-cj3 rdf .type rdf s List - 
^:g-4 rdf: type owl rCl^ea . 

:g4 owl:oaaeOf ; ?g5 - 

~ : g5, rdf: first - _ 

""sgS rdf-rest - - ft 
~ a g5 rdft type rdf: List - 

- rdf * first • # -;3 3 
« rdf =rest _sg3 . # _=D 3 

- rd£;type rdf =Xj.st . # 

The two-stop process does Ubd the last node. 

^taSS^SS! 1 ^ to do wiU, ppiimiziug the alnorithin 
J^SS hut concentrates on the complexity d** of the 
algotidims. 

S«tems would benefit from merguiB ^ '"'"^^^ 
hernial to icducc the total numb* of passes of the graph 

required. 

Also, given that most canonicaUzaticm is for objects ~»? 
ao^x to fit in memory, toe cwopti™ m terms °fsoitag6les« 
prob^Jy Tnisleadmg. ln-m=nCvy structures such .as b-lrees Jfl 
ESS be n«ed, nnd m^taminfi mom to sort ori^. may be more 
SSni ^ pennttting mem tc become inserted end xesomtig 
fhem as a separate algorithmic srep. 

Tbe muhisiep methods in particular arc o^^,- 
cpn^cauoa. Often tho ^ wDl complciely Ubd IftoWjJ 
Mdes of the graph. The subsequent stages «* **n redundant 
Moreover, during me first stage wc en ^^* csft » Z 
the etuph that xoq^ ^ ^ork The ones ftal «J 
fi^nwk, play no f^hor »le "» *e algcnflim, except being 
included, in a final merge, into the output. 

10.3 Multiple Arcs in a Single Comparison 

The methods presented, have been constrained for simplicity, to 
ZtJZtZ* £ impWnted using Uhi* »rt and «wfc over N- 
Triples. . ^ 

On larfi* graphs it may be the case that « »nofe ***** 
BOtparticipaieiTi distinctive triples, but &T which the m of triple* 
iiiwWchthcyparucipateis distinctive. A simple example is, 

a <:^g-.eat> "apple* - 
a <eg;eat> -pear" - 
? b «gs©at> "apple" - 
ife <egseat> "ijanaB-a* - 
c <eg?^t> ^ear" . 
"iC <egsaat> »haaW - . LI . , 

None ofthe triples is c&tinctivc, but _= a is the only blank node 
eating both apples and peers. 

It may be advantages » collect all me triples in whid. n^ode 
partidpates and uso the set of labels onfhose triple. :ns . label m 
attempt to distinguish *6 node. 

11. PRACTICAL RDF C14N 
111 Problems concerning rdf;CoIlection 

m wnmpfc 3 we aaw that Eats with twn or more eiwencsercaw 
^*e rff S par S eTyp e ="Calloonian- syntax' RJ. may 

create a number of blank nodes which: 



t 
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• Arc the subject of a triple with predicate rd£ : type, and 
object xdf : L i » t- 

• Are The subject of a Triple with predicate rxlf : firs t and 
obj cct being a blank Dode. 

• Are the subject of a triple with predicate rf£ : reet and 
object being a blank node. 

• Are the object of a triple with predicate ?r«?r&st and 
subject bemg a bl vnk node. 

Since none of ihese triple is distinctive, and the node docs vpi 
participate in any others, the one pass deterministic labeling will 
uot label eueh nodes. Moreover tlie mvimple ate label suggested 
above (section 10.3) docs not help. 

Given the importance of OWL m The semantic web, and the desire 
to canonicaKze and digitally sign ontologies, this _ is a 
disadvantage with one-step methods. Moreover, additional 
distinctive axes on such nodes, added as a result of the non- 
deterministic cheating algorithm, would cause the rcBultmg graph 
to be one chat cannot use the compact 
rdf * par seType= m collect i on" syntax- 
ForhrnAtely the multistep methods will typically distinguish the 
blank nodes in a typical OWL ontology (possibly requiring the 
multiple arc label method). This is because while the elements of 
the list may nil be blank, for example a list of restrictions, enough 
of these elements will participate in distinctive triples, for 
example, distinctive owl ^naa Value triples. When adding 
tether distinctive triples it is best to try and exploit the mwltujiep 
canonicalization and to avoid adding them to the blank nodes 
carrying the list structure, but instead to add them to the members 
of the list (e. g. to a. blank node of typ e owl r R6S tx i et ion). .. ; 

Further study is needed, studying ontologies hi say the PAML 
tfbrary (12] and trying n variety of candmcalization techniques 
over them, to see which work best. If none of these -methods 
proves sufficiently satisfactory, it would* be possible to have 
special treatment of the rd£ = List construct (for example adding 
arc* showing the distance between list members, before 
caricro£alizanr>n). 

11.2 Disadvantages of Graph. Modification 

The nandctesTTumstic algorithm potentially triples the size of the 
gr«5>h (worst case). This paper is predicated on that potential 
occurrence not actually happening. The dcluiuiiuistic part of the 
algorithm will label almost all the nodes in a typical RDF graph 
Both of the cheating methods, (deletion or addition) change the 
RDF graph. In this sense the algorithms presented are not 
canonicalization algorithms. However, we have argued that for 
practical applications These changes am acceptable, as long as: 
they happen relatively rarely; they are expected; and that the 
choice between the deleting method or the adding method has 
been made appropriately for the application. 
A further difficulty is found when e^flOJticalizttig multiple subsets 
of an RDF graph, using the nondetemnnisti? method. The 
required changes for one such canonicalizanon will most likely be 
incompatible with the required changes for another- This can be 
addressed by using super-properties of dl4n : true to create the 
distinctive triples. . 

12. COMPLEXITY 

All the steps in this paper either Involve lexicographically sorting 
a file, for which the best known algorithms are Ofrlogfl) [ID] or 



involve stepping flirongh a Kin line by line and doing either the 
same simple modification, or a modification involving a table 
lookup (c-s- etnpbii hi section 7). The former steps are 0(n) 3 the 
latter Ofrlogn) (table lookup is OQogw), sec [1])- Since any 
particular variant of the techniques involves a constant number of 
such flort or iriodification steps the overall complexity is 0(nlogn) 
where n is the number of triples in the RDF graph* 

13* TEST DATA 

To sec bow much aondc lei na nism is needed to csnomealise 
typical RDF data we ran the algorithm* over the foil owing tour 
sources: 



Source 


1 


2 


3 


* 


RDF test data [16] 


130 


0 


0 


0 


OWL test date [9] 


45 


2 


0 


0 


OWL guide [25] 


0 


0 


0 


1 


DAML ontology library [12] 


93 


10 


5 


40 



The last line shows thai of the files tested from The DaML 
ontology library 93 were deteimimsticany labeled by the one-step 
algorithm. A further 10 were dcterrainistically labeled using the 
two-step algorithm, but that 40 were not fully labeled even with 
the three step algorithm- The first two lines show that the N- 
Triples files include with lest cases for RDF and OWL are not 
su fl ficifrntly chaDenging to be go od rests for canonicalizarion. 

Thus, at least for ontologies, tc appears that direct deterministic 
•methods are not sufficient., and that nondctomrurustlc techniques 
such as described in section 9-2 are practically necessary. 

14. THE FUTURE 

The techniques of this paper are not particularly useful in a single 
stand-alone softwjtre product* 

Tftey only will be useful as part of a society (a social, legal nod 
technical framework) in which, there is widespread' use of digital 
signatures, and wide agreement about what gets signed, and what 
legal and social obligations such signatures convey. 

This agreement will also detail the exact forms of eanonicalimtion 
used; which will not be precisely what has been articulated in this * 
paper. 

Hence, this paper is largely irrelevant unless it feeds into a 
standardization process. 

Further research could be directed towards the fairly practical 
question ot, given RDF data such as we find today, which 
deterministic classification algorithms are particularly adept; t.c- 
leavinj lew nodes unclassificd. 

Once there ia a well-researched answer to mis question, then 
standarrJizinxr the techniques of this paper should be 
straightforward. 
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CLAIMS 

1. A method of processing data comprising the steps of: 

processing data in accordance with a first set of rules, which operate, inter alia to 
5 define a stage at which such a processing operation ceases; 

applying to the partly-processed data a second set of rules, which operate to modify 
the data, so that the modified data may be processed in accordance with a third set of 
rules - 

10 2- A method according to claim 1 wherein the first and third sets of rules are the same. 

3. A method according to claim 1 or claim. 2 wherein the modification in accordance 
with the second set of rules modifies the data in a significant manner. 

15 4. A method according to any one of the preceding claims wherein the modification in 
accordance with the second set of rules modifies the data such that the modified data is 
processable by the third set of rules. 

r 

j 

5_ A method according to any one of the preceding claims wherein the data is 
20 graphically represented data. 

6. A method according to claim 5 any one of the preceding claims wherein the data is 
an RDF graph. 



25 7. A method according to any one of the preceding claims further comprising the step, 
performed prior to processing of the date in accordance with the first set of rules, of 
modifying the data in an insignificant manner. 

8. A method according to any one of the preceding claims, wherein the significant 
30 modifications include the deletion of significant data. 
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9. A method according to any one of the preceding claims wherein the significant 
modifications include the addition of significant data. 

10. A method according to claim 9 wherein the significant additions are distinguishable 
5 fiom data which is, prior to performance of any modifications, significant. 



11. A method according to any one 
ontology- 



.of the preceding claims wherein the data describes an 



10 12. Amcthodacc^diagtoanyoncofme^ 

of processing the data in accordance with the third set of rules. 

13 . A method according to claim 12, further comprising me step, subsequent to the 
processing of the data in accordance with the third set of rules, of writing or verifying a 

1 5 digital signature establishing authenticity of the data. 

14. A method according to any one of the preceding claims, wherein reapplying the 
method of any one of the preceding claims to data processed in accordance with such a 
method does not result in any further modification of die data. 
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